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Recent technological advances have made it possible to simulta- 
neously measure multiple protein activities at the single cell level. 
With such data collected under different stimulatory or inhibitory 
conditions, it is possible to infer the causal relationships among pro- 
teins from single cell interventional data. In this article we propose 
a Bayesian hierarchical modeling framework to infer the signaling 
pathway based on the posterior distributions of parameters in the 
model. Under this framework, we consider network sparsity and model 
the existence of an association between two proteins both at the over- 
all level across all experiments and at each individual experimental 
level. This allows us to infer the pairs of proteins that are associ- 
ated with each other and their causal relationships. We also explicitly 
consider both intrinsic noise and measurement error. Markov chain 
Monte Carlo is implemented for statistical inference. We demonstrate 
that this hierarchical modeling can effectively pool information from 
different interventional experiments through simulation studies and 
real data analysis. 

1. Introduction. Cells respond to internal and external changes through 
signaling networks. One major research area in biology is to identify signal- 
ing proteins and understand how they coordinate to function properly. With 
recent technological advances in genomics and proteomics, researchers now 
can monitor and quantify molecular activities at the genome level, making 
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it possible to reconstruct signaling pathways from these high-throughput 
data. Although efforts have been made to use microarray gene expression 
data and sequence data to reveal signaling pathways [e.g., Liu and Ringner 
(2007)], these data are limited in two important aspects. First, signaling 
pathways function at the protein level, so measured gene expression levels 
from microarrays at most can provide a proxy to the protein activity levels. 
Second, each cell may behave differently from other cells due to complex 
interactions among many proteins, some substantially. Therefore, popula- 
tion level data collected by microarrays can mask individual cell differences, 
making it difficult to infer underlying pathways. In contrast, single cell level 
protein activity data offer much richer information for pathway inference. 

Flow cytometry [Herzenberg et al. (2002); Perez and Nolan (2002)] is 
a powerful fluorescence-based technology that can make rapid, sensitive, and 
quantitative measurements of multiple proteins for thousands of individual 
cells. It can measure both a specific protein's expression level and protein 
modification states such as phosphorylation. Therefore, phospho-protein re- 
sponses to environmental stimulations can be monitored at the single cell 
level for thousands of cells very efficiently, and this technology has been 
employed to infer signaling pathways through gathering activity levels of 
multiple proteins under different stimulatory or inhibitory conditions [Sachs 
et al. (2005)]. We focus on the analysis of single cell flow cytometry data in 
this article. 

Several methods have been applied for network inference based on ge- 
nomics data, including Bayesian Networks (BNs) [Pe'er et al. (2001); Pe'er 
(2005)], Markov Networks (MNs, also called Markov random fields) [Wei 
and Li (2007, 2008)], and Dependency Networks (DNs) [Heckerman et al. 
(2000)]. Common to all these methods, each protein (or gene) is represented 
by a node and a dependency between two proteins is represented by an edge 
in the network. More formally, we define a graph Q = (V, E) with its nodes 
V = {1, . . . , P} and an edge set E. We use Xi to refer to the value of the 
zth node, that is, the expression level of the zth protein. The methods differ 
in how the edges are inferred from the observed data. In BN, the network 
is a directed acyclic graph where the state of each node only depends on its 
immediate ancestors. This structure imposes Markovian dependency among 
all the nodes stating that each variable is conditionally independent of its 
nondescendants given its parent variables. So the joint likelihood for all the 
nodes, that is, proteins, can be factored into a product of conditional proba- 
bilities. BNs pose significant computational challenges to learn the network 
structure because the model space to be explored is super-exponential in 
the number of genes to be studied. More recently, Ellis and Wong (2008) 
proposed a method to reduce the bias in the fast mixing algorithm pro- 
posed by Friedman and Killer (2003) to sample the BN structures from 
the posterior distribution. MNs are undirected graphical models and are 
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similar to BNs in representation of dependencies: each random variable is 
conditionally independent of all other variables given its neighbors. Gaus- 
sian Graphical Models (GGMs) [Lauritzen (1996); Schafer and Strimmer 
(2005); Dobra et al. (2004)], a subclass of MNs, assume a multivariate nor- 
mal distribution as the joint distribution of random variables. The existence 
of an undirected edge in a GGM is implied by the nonzero partial correla- 
tion coefficient derived from the precision matrix. Some studies have found 
that BNs outperform GGMs in inferring networks based on interventional 
data where the biological system is perturbed through designed experiments, 
but GGMs may perform better for observational data, for example, Werhli, 
Grzegorczyk and Husmeier (2006). DNs aim to reduce computational bur- 
den where a large number of genes are modeled by building a collection 
of conditional distributions separately. DNs define the conditional distribu- 
tions {p(Xi\X_i)} separately for each X{. When we focus on sparse normal 
models, DNs define a set of P separate conditional linear regression models 
in which X, is regressed on a small selected subset of predictor variables, 
which are determined separately. 

Because a statistical association between two variables only implies asso- 
ciation not causation, standard DN and GGM approaches cannot be used 
to infer causal networks, a goal in signaling pathway analysis. In this arti- 
cle we develop a Bayesian hierarchical modeling approach based on DNs to 
address this limitation. This is achieved through appropriate intervention 
experiments to dissect directional influences. To accommodate varying rela- 
tionships among proteins under different experimental conditions, we allow 
a different set of regression models for each condition. At the same time, the 
hierarchical framework imposes similar functional forms across conditions 
to borrow information from different experiments. As for causal inference, 
the basic idea is that for any protein i, its regulators exert similar effects 
if it is not intervened, and would have no effect when i is controlled. In 
contrast to standard regression models where the predictors are assumed to 
be error-free, our model allows measurement errors in predictor variables. 

A large part of the difficulty in the standard BN computation is due to the 
requirement that the network be acyclic. Our approach is not guaranteed 
to give acyclic networks. However, in terms of sensitivity and specificity to 
detect true edges, our method is competitive with the best methods that 
impose the acyclic graph assumption. This is illustrated by our results in 
the example of Sachs et al. (2005). 

The paper is organized as follows. In Section 2 we describe two hierarchical 
models (a general hierarchical model where no constraints are imposed and 
a restricted hierarchical model where a symmetry constraint is imposed) 
and the methods for statistical inference of the network. For comparison, we 
also describe a nonhierarchical model where all the experiments are pooled 
together for analysis. Then we investigate the performance of these methods 
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on simulated data in Section 3. In Section 4 we apply these methods to data 
from a study of the signaling networks of human primary naive CD4 + cells 
[Sachs et al. (2005)]. We finish the paper with discussions in Section 5. 

2. Methods. The primary goal of our statistical model is to infer causal 
influences among proteins from interventional data. In this section we de- 
scribe three models: a hierarchical model (HM), a restricted hierarchical 
model (RHM), and a nonhierarchical model (NHM), that can be used to 
infer the relationships among proteins. We also discuss statistical methods 
to infer causal networks in this section. 

2.1. Hierarchical model (HM). First we discuss a Bayesian hierarchical 
model to infer the relationships among proteins both at the overall level 
across all experiments and under individual experimental conditions. Our 
model incorporates both measurement errors and the intrinsic noises due to 
the biological process and unmodeled biological variations. 

Let P denote the number of proteins, K denote the number of experimen- 
tal conditions, and Nk denote the number of samples (individual cells) under 
the /cth condition, k = 1, 2, . . . , K. We further let Xi n k denote the true activity 
level of the ith protein in the nth cell under the feth experimental condition, 
and Xi n k the measured value of its activity, where Xi n k = Xink + £ flk^ an< ^ 
the measurement error ef^ k is a normal random variable with mean and 
standard deviation a M , that is, £^ k ~ N(0, (a M ) 2 ). Our model assumes that 
there exists a linear relationship among the activity levels of proteins. That 
is, for each protein i = 1, 2, . . . , P and for each condition k = 1, 2, . . . ,K, 



where e\ nk is the intrinsic noise and has a normal distribution N(0, (of) 2 ). 2 



of the measurement errors {e^}- I n equation (1), a\- = if there is no 

linear relationship between the activity levels of proteins i and j under the 

(k) 

kth experimental condition. A nonzero value of a- implies the existence of 
a linear relationship (but not necessarily a causal effect). To correctly infer 
the network among these proteins, we need to first find, for each protein, the 
subset of proteins that are linearly associated with its expression level, which 
is implied by the set of nonzero coefficients in (1). The linear relationship 



2 Here a constant variance (erf) 2 is assumed for the intrinsic noises of a particular 
protein. We can relax this assumption and allow varying variances (cr/ fe ) 2 for intrinsic 
noises under different experimental conditions. This extended model and simulation results 
are described in Supplementary Material SI [Luo and Zhao (2010)]. 



(1) 
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among the true expression values Xi n k implies that the observed values are 
also linearly related: 

x ink = a4) a ij ^ ( x jnk ~ e j n k) + £ ink + e ink 

(2) 

~~ "iO ' / ; ii x ink ' fc ink * t ink ij jnk- 

3+i 3+i 

Comparing (1) and (2), we can see that correctly inferring the relationship in 
the network depends on the correct inference of the set of nonzero coefficients 
in (2). 

For each protein, we utilize indicator variables Zij = 0/1 to denote the 
relationship between proteins i and j such that z^j = 1 if and only if the 
coefficient of the jth protein in the regression model for the ith protein is 
nonzero. The values of may differ under different experimental conditions. 
For example, if protein j regulates protein i, z^ = 1 when i is not controlled 
and the association strength between the two proteins should be similar 
under such conditions. However, z^ = when i is controlled because the 

relation between Xi and Xj is destroyed. Therefore, it is natural to use a hie- 

(k) 

rarchical structure to formalize this thinking. We use w\j to denote the 
probability that z^ = 1, that is, protein j is related to protein i under the kth 

(k) 

experimental condition. The prior we take for the regression coefficient a\j 
is a mixture of two distributions. One is a point mass at zero, indicating 
that the jth protein is not linearly related to the ith protein under the 
kth condition. The other is a normal distribution for nonzero effects, with 
weight Wjj. Specifically, the prior for the slope coefficient {j ^ i) is 

(3) a^VS^^^-a-^^^^ + ^N^I^'K) 2 )' 
where <5o(-) indicates a point-mass at zero, and w$ is the probability that 

a[j 7^ 0. When ^ 0, the prior for a\j is N(ajj, (afj) 2 ) with a common 
mean and a common standard deviation across different experimental condi- 

(k) 

tions. Under this setup, information is shared for coefficients a\j across dif- 

(k) 

ferent conditions. Similarly, we borrow information for w\j across different 

(k) 

experimental conditions by applying a beta distribution as a prior for w\a 
with a common mean Wij and a common variance — Wij)/[vij + 1): 

(4) w\j\wij,Vij ~ Bet&(wijVij, (1 - Wij)vij). 

(k) 

So w\j measures the probability that there is an association between pro- 
teins i and j under the kth experimental condition, and Wij measures the 
overall- level probability that the two proteins are associated. 
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To complete the model, we specify a beta distribution Beta(/3i , fa) for Wij , 
a normal distribution N(qW,tW) for and gamma distributions: 6(71,72), 
6(73,74) and G(75,7@) for (cr,[)~ 2 , {&fj)~ 2 and (<7 A/ )~ 2 , as their respective 
prior distributions. In the simulation studies, we take 7, = fij = 1 for i = 1, 2 
and j = 1, 2, . . . , 6, a® = and r (i) = 1000 for i = 1, . . . , P. In real data 
analysis, we vary the hyperparameter values to study the sensitivity of the 
inference results to these values. We note that the posterior distribution is 
proper since we take proper priors for all the parameters. 

(k) 

One attractive feature of this model is that when is integrated out, 
the marginal distribution of a\j is independent of Vij : 

(5) a\f iCj j .<\j j .(T'>: ~ (1 - Wij)S (alf) + WijN(alf \oiij, (erg) 2 )- 
Given and Wy, when is specified, the posterior distribution of is 

(6) \wij,Vij,a\y ~ Beta(wijVij + I(a\f / 0), (1 - + I(a\f = 0)). 

Hence, we can first sample the posterior distributions of a\z and w^j, and 

then sample wff according to equation (6). 

Under this model, the inference of the causal network consists of two steps. 
First, based on the posterior means wuj\ of the overall-level probability 
0.5 x (wij +Wji), we infer whether there is an association between proteins i 
and j with a certain threshold ui. Second, for each pair of proteins (i,j) 
that are inferred to be associated, we determine their regulatory direction 

(k) 

based on the experiment-level probabilities to infer the causal network. 
The underlying assumption of our inference is that for a pair of proteins 
(i and j) that has a regulatory relation, say, i regulates j (i—tj), controlling 
(inhibiting or activating) over protein j affects the activity of j but not i, 
resulting in much reduced or lack of association between i and j; controlling 
over protein i affects the activity of i and hence j, keeping the association 

(k) !k) 

between them. The posterior distributions of w\j given a- and Wij, with 

prespecified, is given in equation (6). To better reflect the changes of 
for different k, we use =0.1 in our analysis, because larger values of 

(e.g., 10) are not able to reveal the changes in as the parameters in (6) 

are dominated by when Vy is large, and I{ot~-j ^ 0) plays a smaller role 
in (6). To put this into more concrete terms, we consider Wij = 0.9 as an 
example, which gives a strong support for the association between proteins i 
and j. The difference between the distributions Beta(0.9 x 10 + 1,0.1 x 10) 
and Beta(0.9 x 10, 0.1 x 10 + 1) when = 10 is much less than that between 
Beta(0.9 x 0.1 + 1, 0.1 x 0.1) and Beta(0.9 x 0.1, 0.1 x 0.1 + 1) when v {j = 0.1. 
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To determine the directions of edges, we calculate the posterior means w - 
of Wjj for all k and pairs. For each pair (i, j), if all the values in a stream 

(e.g., {Wji}k) are small (less than a threshold 112 > 0, so the signal in this 
stream is weak compared to noises), then we ignore this stream and infer 
the causal relations only based on the other one {{w^}k)- The inference is 

based on checking whether under specific conditions decreases greatly 
compared to the highest value. Let Sij = {k : k £ {i,j}} denote the set of 
conditions under which i or j is perturbed, and \Sij\ be its cardinality. 
We propose the following four criteria to determine the causal relationship 
between an associated protein pair {i, j): 

• Case 1: \Sij \ = 1, that is, i or j is only perturbed in one condition. With- 
out loss of generality, we suppose that protein i is controlled under con- 
dition k' (Sij = {A/}). If max.k{w[j } — ^ > U3 for a threshold U3 > 0, 
then from stream {w\j }k we infer j — > i. Otherwise, we infer i — > j. Sim- 
ilarly, we make an inference from the stream {wj^}k- If the directions 
inferred from both streams are the same, say, i—tj, we infer that direc- 
tion as the direction of the edge between i and j : If the directions 
from both streams are different, we say that the direction of the edge is 
undetermined. Taking the conditions in Table 1 for the network in Fig- 
ure 1 as an example, pairs (1,2), (1,8), (2,6) (3,4), (4,5), (6,8), (8,10), 
and (8,11) belong to Case 1. 

• Case 2: [5^1 > 1 and for all k € Sij, the same protein, say, i, is controlled. 

For each stream, for example, {w^}k, if maxfc{u)j^} — ' > U3 for all 

k' 6 Sij, then we infer j — > i; if max.k{w^ } — ^ < 113 for all k' G Sij, 
then we infer otherwise, we do not infer a direction from this stream. 

If both streams lead to a directional inference and the directions are the 



Table 1 

A summary of the nine experimental conditions for the data in Sachs et al. (2005) 





Stimulus 


Effect 


1 


CD3, CD28 


general perturbation 


2 


ICAM2 


general perturbation 


3 


Akt-inhibitor 


Inhibits Akt 


4 


G0076 


Inhibits Pkc 


5 


Psi 


Inhibits Pip2 


6 


U0126 


Inhibits Mek 


7 


Ly 


Inhibits Akt 


<s 


PMA 


Activates Pkc 


9 


p 2 cAMP 


Activates Pka 



<s 
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Fig. 1. Pathway adapted from Sachs et al. (2005) by including three missed edges and 
correcting one reversed edge. Nodes represent proteins, and directed edges represent signal 
transduction. 



same (Figure 4, top panel), or if only one stream provides a directional 
inference, then we infer the direction of the edge. Otherwise, the direction 
is undetermined. For the conditions in Table 1, pairs (1,9), (3,9), (5,7) 
(6,7), (9,10), (9,11) belong to Case 2. 

• Case 3: \Sij\ > 1 and both proteins are controlled in the experiments. 

d) 

Let Sj- ■ denote the set of conditions under which protein i is controlled, 

(i) 

and S- ■ the set of conditions under which protein j is controlled. For 

each stream, for example, {w^}k, we calculate the differences of 

when % or j is controlled: d^ 1 = w^ 1 ' — w^ 2 ^ for each k\ E sfj and 

k 2 E S!f] . If d\j lk2) > u 3 for all h E sf] and k 2 E S$ , we infer that i -> j; if 

df 5 ^> < -ua for all h € 5£ and k 2 E S% , we infer that j — > i; otherwise, 
the direction is undetermined from this stream. If both streams lead to 
a directional inference and the directions are the same, or if only one 
stream provides a directional inference, then we infer the direction of the 
edge. Otherwise, the direction is undetermined. For the conditions in Ta- 
ble 1, pairs (2,9), (4,9), (7,8) (8,9) belong to Case 3. 

• Case 4: \Sij\ = 0, that is, no perturbation is conducted on either protein. 
In this case, we cannot infer the causal relation. 
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The choices of the thresholds u\, 112, and 113 will be discussed in sim- 
ulation studies. Generally speaking, the two steps involved in causal net- 
fit) 

work inference are based on the posterior distributions of Wij and w\j . The 
overall-level probability measures the strength of the linear relationship 
between two proteins across all the conditions. Based on Wij, we infer the set 

of proteins that are related from which we determine an undirected graph. 

(k) 

The changes in the experiment-level probabilities offer insights on the 
directions of causal regulations. 

2.2. Restricted hierarchical model (RHM). The Wij and in (3), (4), 
and (6) denote the probability that protein j is included in the linear model 
to predict the activity level of protein i across all the experiments and under 
the kth specific condition, respectively. In this framework, we may impose 
the constraint that Wji = Wij and = for each k, that is, the existence 
of a linear relationship between proteins i and j is independent of which 

variable is the predictor and which is the response variable. We can infer 

(k) 

the posterior distributions of and under this constraint, and call 
this model a restricted hierarchical model (RHM). Based on the posterior 
means of Wij , we can infer whether proteins i and j are associated with each 
other by setting up an appropriate threshold u^. For each associated pair, 

(k) 

we can infer the causal relationship according to the changes in w\j . The 
choice of the threshold will be illustrated in Section 3.1 and Supplementary 
Material S3 [Luo and Zhao (2010)] details the criteria in determining the 
causal relations for the associated pairs of proteins. Different from HM, we 
must prespecify Vij in RHM to sample from the posterior distributions of Wij 

(k) 

and . We will show how different values of affect the network inference 
in the following discussion. 

2.3. Nonhierarchical model (NHM). To demonstrate the usefulness of 
the hierarchical model approach, we also consider a nonhierarchical model 
(NHM) as a reference model for comparisons. The NHM assumes a linear 
model among the activity levels of proteins and incorporates both measure- 
ment errors and intrinsic noises as in equation (2). The main difference is 
that this NHM assumes identical regression coefficients across different ex- 
perimental conditions: 

(7) Xi n k = OiiQ + ^ ] OiijXjnk + £j„fc + £j n /u — ^ ] ajj£j n ^, 




where the intrinsic noise e\ nk follows the normal distribution N(0, (<r/) 2 ), the 
measurement error ef^ k follows the normal distribution N(0, (a M ) 2 ), and 
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they are assumed to be independent. As in HM, we also apply mixture 
distributions as priors for the coefficients a«: 

(8) aij ~ (1 - Wij)5 (aij) + WijN(aij\a,T 2 ). 

The posterior distributions of Wij provide information about whether pro- 
teins i and j are associated. However, it is impossible to make causal infer- 
ence from this model. 

For all three models, we use MCMC methods to sample the posterior 
distributions. Supplementary Material S2 [Luo and Zhao (2010)] provides 
details of the MCMC updates for HM. The MCMC updates for RHM and 
NHM are similar and not shown in this paper. 

3. Simulation study. We first apply our methods to simulated data to 
illustrate how to infer the causal network from the posterior distributions of 

(k) 

the overall-level probabilities and the experiment-level probabilities w\- , 
for both HM and RHM. We also study how the inference differs between 
these two methods and for different choices of Vij . We then study the perfor- 
mance of our methods on simulated data with heavy tail distributed intrinsic 
noises. 

We simulate data based on the network shown in Figure 1, which is 
adapted from Sachs et al. (2005) by correcting one reversed edge and in- 
cluding three missed edges. From Figure 1, we can derive the parent set 
for each node (protein). For any protein i, we first generate the associa- 
tion strength from the uniform distribution over the interval [0.5,2], 
and randomly assign the sign of ctij. Given the activities of its parents, 
we simulate the activity Xi of protein i from the normal distribution: Xi ~ 
N(aio + Ylj a ij%j 1 { a i) 2 )i where the sum extends over all parents of protein i. 
Thus, we get the empirical distribution of x% when protein i is not intervened. 
Let Xi denote the observed expression level of protein i, then Xi is simulated 
from N(aij, (a M ) 2 ). We simulate the interventional data as follows. For an 
intervention experiment, if the ith protein is inhibited, we sample xi from 
the left tail of its empirical distribution obtained when protein i is not per- 
turbed, beyond the 5th percentile. If the ith node is stimulated, we sample Xi 
from the right tail of the empirical distribution, beyond the 95th percentile. 
We simulate a total of nine stimulatory or inhibitory interventional con- 
ditions, as summarized in Table 1. Under each perturbation condition, we 
simulate expression levels for each of the 11 proteins for 600 individual cells. 
We consider two cases: (1) constant intrinsic variances (<j{) 2 = 1 and (2) 
variable intrinsic variances with of = 0.1 x yiG(2,l), where IG(2, 1) repre- 
sents the inverse gamma distribution with mean 1 and variance oo. Finally, 
we simulate data where the intrinsic noises are sampled from a heavy tail 
distribution: i(l), which represent a central t distribution with one degree 
of freedom. 
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Fig. 2. Posterior means of (wij +Wji)/2, sorted in increasing order, from one 

MCMC run of the HM on the simulated data with constant intrinsic variances. Large 
solid and small empty circles represent true and false associations, respectively. 



I\2 



3.1. Constant intrinsic variance: (07) 



1. 



3.1.1. Inference from HM. Based on the simulated data, we obtain sam- 
ples for both Wij and wu from their posterior distributions under HM. To 
infer whether an association exists between proteins i and j, we obtain the 
posterior means wuj\ of the average of the probability that each is included 
in the regression model of the other: (w^ +Wji)/2. Higher values of wuj) 
imply stronger evidence of association between the two proteins. Figure 2 
shows the posterior means wuj\, from one MCMC run, for each pair (i, j) 
(i < j), in the ascending order of wuj\. Large solid circles represent true as- 
sociations, and small empty ones represent false ones. We see that the true 
associations dominate the higher values of wu^y 

To infer the pair of proteins that are associated, we need to set a thresh- 
old u\ on the posterior means wuj\ so that those above the threshold are in- 
ferred to be associated. The permutation study 3 offers an over-liberal thresh- 
old (<0.1), based on which we get over 40 associations with false positive 
rate >0.5. Noting the jumps in the plot of wuj\, we propose to choose the 
threshold where a big jump occurs. Setting the threshold u\ as any value 
between 0.2 and 0.4 and choosing the pairs with wufi > u\, we get 22 asso- 
ciations with 2 false positives. When we have multiple MCMC runs, which 



3 We permute the observations for each protein and then analyze the permuted data 
with HM. The obtained posterior means wuj) are less than 0.1 for all pairs. 
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lead to multiple plots of wuj), we can combine the inferences from them. 
Figure SI in the Supplementary Material [Luo and Zhao (2010)] draws the 
plots of wu j\ from four additional MCMC runs. They show the same fea- 
tures as seen in Figure 2 that true associations tend to have high 
values and jumps exist in these plots. These five MCMC runs lead to quite 
similar results: from four of them we get 22 associations with 2 false posi- 
tives, and from a fifth run we get 21 associations with 2 false positives and 1 
missing association, when we choose u\ between 0.3 and 0.4. Let ut be the 
relative frequency that each association is selected. When ui € (0.3,0.4) and 
Uf > 4/5, we get 22 associations with 2 false positives (Figure 3). 

For the pairs of proteins that are inferred to be associated, we then infer 
their causal directions based on the criteria listed in Section 2.1. To bet- 
ter illustrate the criteria, we give two examples in Figure 4. The top panel 

(k) 

draws the boxplots of the samples from the posterior distributions of w^ 7 




Fig. 3. Networks inferred by choosing associations with u\ £ (0.3,0.4), 112 = 0.1, 
uz £ (0.3,0.5), and Uf > 0.8 in five MCMC runs of the HM on the simulated data with 
constant a\ . Solid arrowed lines represent correctly inferred true edges, dashed thick lines 
with labels "u" represent edges whose directions cannot be determined from the simula- 
tions, dashed arrowed thin lines with labels "r" represent reversed edges, and dotted lines 
with labels "+" represent false positive edges. 
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Fig. 4. Box-plots of the samples from the posterior distributions of w^j and ra^' (top 
panel), aig*' and Wjg (bottom panel) when Vij =0.1 for all i and j. This is from one 
MCMC run of the HM on the simulated data with constant af . 

(k) 

and u> 7 5 • Both show that the experimental-level probabilities greatly de- 
creased under conditions 3 and 7 where protein 7 (Akt) is inhibited (here 

maxfcju;^} — wl^ ^ > 0.8 and m.axk{w 7 k ^} — w 7 \ ^ = 0.9 for k' = 3, 7). So we 
infer the direction 5—7-7 (i.e., PIP3 — > Akt). The bottom panel tells a dif- 
ferent story. The posterior means of w^q when k = 3 or 7 are much smaller 
than others (maxfcju)^} — w 7 q ^ = 0.9), indicating the causal relation 6 — > 7, 
but w^j keeps the same level under all conditions (maxfc{zi)g^} — w^ 7 ^ = 0), 

(k) 

indicating the causal relation 7 — > 6. The contradictory results from Wq 7 

(k) 

and w 7 Q lead to the failure in determining the causal relationship between 
proteins 6 (Erk) and 7. Taking = 0.1 and € (0.3, 0.5), we infer a causal 
network as shown in Figure 3, which contains 14 true directed edges, 5 edges 
whose directions are undetermined, 1 reversed edge, and 2 false edges. 

When we have multiple MCMC runs, we infer the causal relation of each 
edge based on the majority vote of the directions inferred from each MCMC 
run. In fact, these five runs lead to almost identical causal inference for the 
common associations [based on u\ G (0.3,0.4)] when we take U2 = 0.1 and 

U3 £ (0.3,0.5). The choices of U2 and U3 are affected by the value of v^. 

(k) ~ (k) 

Choosing v) „• =0.1 ensures that most w)j are either above 0.9 or below 0.1. 

The streams with w^' < 0.1 for all k contain too weak a signal to provide 
sufficient information for causal inference. So we take U2 = 0.1. The small 
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Fig. 5. Inference from RHM with Vij — 0.1 on the simulated data with constant <j\. Left: 
posterior means u)uj) ofniij, sorted in increasing order. Right: inferred networks with 
u[ — 0.2, it3 £ (0.3,0.5). Solid arrowed lines represent correctly inferred true edges, dashed 
thick lines with labels "u" represent edges whose directions cannot be determined from the 
simulations, dashed arrowed thin lines with labels "r" represent reversed edges, and dotted 
lines with labels "+" represent false positive edges. 



value of v\- also leads to a great difference in w\- for different experiments 
when protein i or j is intervened. In this simulation study, intervention of the 

(k) 

child node for one edge leads to a decrease of at least 0.5 in w\j . Any value 
of U3 in (0.3, 0.5) leads to the same directional inference for the inferred 
associations. 

(k) 

3.1.2. Inference from RHM. RHM requires that Wij = Wji and = 

(k) 

for all i, j, and k. This restriction aims at avoiding the nonconsistent 

(k) (k) 

directional inferences based on and w\a separately as HM does. We plot 
the posterior means of in Figure 5 where we take = 0.1. Similar 
to Figures 2 and SI, true associations tend to have higher values of w^y 
Setting u'i = 0.2, we infer 22 associations with 2 false positives. Applying 
the criteria listed in Supplementary Material S3 [Luo and Zhao (2010)], we 
infer the causal network as shown in Figure 5. Compared to the network in 
Figure 3, RHM leads to a network with 16 true directed edges, 3 edges whose 
directions are undetermined, 1 reversed edge, and 2 false edges when we take 
u[ = 0.2 and U3 6 (0.3,0.5). If we increase the threshold v! x to a value where 
there is a big jump, for example, 0.3, we will miss 1 true directed edge. 

When a bigger value = 10 is applied, the differences of the posterior 
means W(ij-) of w^j become much smaller between the true and false asso- 
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Fig. 6. Posterior means Wuj\ of (wij +Wji)/2, sorted in increasing order, from one 
MCMC run of NHM. Small empty and large solid circles represent the false and true 
associations, respectively. 



ciations (Figure S2 in the Supplementary Material [Luo and Zhao (2010)]). 
This together with the fact that bigger values of Vy lead to smaller changes 
in experimental level probabilities results in our conclusion that a small Vij 
is preferred for causal network inference. 

3.1.3. Inference from NHM. Ignoring the effect of perturbations on sig- 
naling pathway, NHM assumes a common coefficient in the linear regres- 
sion models across all experimental conditions. From this model, we can only 
infer whether there is an association between two proteins. Similar to the 
inferred posterior means from HM, utaj) from NHM also tend to take higher 
values for true associations (Figure 6), but with two differences. First, the 
range of uiU j) from NHM is smaller. In other words, compared to HM, NHM 
leads to smaller values of the biggest u>Uj-\ , and larger values of the small- 
est j) . So the support for true associations and the evidence against false 
associations are weaker. Second, the dominance of high values of true associ- 
ations is not as strong as that from HM. More false associations take higher 
values of wuj) than the hierarchical inference. If we take 0.6 as a threshold, 
we infer 23 associations with 15 true and 8 false. Taking 0.45 as the thresh- 
old, we recover all the true associations, but 27 false ones are also inferred. 
More importantly, we cannot determine the directions of associations from 
NHM because perturbation information is not utilized in this model. 



16 



R. LUO AND H. ZHAO 



3.2. Variable intrinsic variances (<j() 2 ■ We then consider the case where 
variances of intrinsic noises vary for different proteins. In this case, both HM 
and RHM with Vij = 0.1 clearly separate the true associations from the false 
ones in the plot of the posterior means wuj) (Figure 7). The causal networks 
inferred from both models are the same, with 18 correctly inferred true edges, 
1 reversed edge, and 1 edge whose direction is undetermined [u\ € (0.2,0.7), 
U2 = 0.1, and M3 = 0.3]. As in Section 3.1.2, RHM with a bigger value = 10 
leads to association inference with bigger false positive rate and smaller 
changes in experimental level probabilities when a child node is perturbed 
(Figure S2 in the Supplementary Material [Luo and Zhao (2010)]). NHM is 
not applied here and thereafter since it does not provide causal relations. 




10 20 30 40 50 10 20 30 40 

indices of pairs indices of pairs 




Fig. 7. Inference results for the simulated data with variable intrinsic variances. Upper 
panel: posterior means w^j) ofwij, sorted in increasing order from HM (left) and RHM 
with Vij = 0.1 (right). Lower panel: inferred networks from both models with ui £ (0.2,0.7) 
and «3 = 0.3. Solid arrowed lines represent correctly inferred true edges, dashed thick lines 
with labels "u" represent edges whose directions cannot be determined from the simulations, 
and dashed arrowed thin lines with labels "r" represent reversed edges. 
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Table 2 

Summary of pathway inference in simulation study 



Data 


Methods 


True 


Undetermined 


Reversed 


Missing 


False 


Hamming 
distance 


Data-1 


HM 


14 


5 


1 





2 


8 




RHM 


16 


3 





1 


2 


6 


Data-2 


HM 


18 


1 


1 








2 




RHM 


18 


1 


1 








2 


Data-1 4 


HM 


9 


4 


1 


6 


4 


15 


Data-2 4 


HM 


14 


1 





5 


6 


12 



All are based on ui = 0.4 and U3 = 0.3. The hamming distance is the minimum number 
of simple operations needed to go from the inferred graph to the true graph. Here simple 
operations include adding or removing an edge, and adding, removing, or changing the di- 
rection of an edge. Data-1: simulated data in Section 3.1 with constant intrinsic variances. 
Data-2: simulated data in Section 3.2 with varying intrinsic variances. Data-1 4 : simulated 
data with parameter settings in Data-1 and intrinsic noises sampled from t(l). Data-2 4 : 
simulated data with parameter settings in Data-2 and intrinsic noises sampled from t(l). 

3.3. Heavy tail distribution for intrinsic noise. Considering the possibil- 
ity of nonnormality for real biological processes, we simulate data where the 
expression levels of proteins have heavy tail distribution. This is realized 
by simulating e\ nk ~t(l) for each protein i under each experimental condi- 
tion k. We reuse the parameter settings in Sections 3.1 and 3.2 so that the 
performance of our methods on the normal and nonnormal cases can be eas- 
ily compared. We summarize the network inference results in Table 2. Due 
to the model misspecification when we use HM to analyze these heavy tail 
distributed data, we infer networks with more false positive and false neg- 
ative edges. Therefore, our current model needs to be extended to analyze 
heavy tail data. 

4. Case study. The Mitogen-Activated Protein Kinase (MAPK) path- 
ways transduce a large variety of external signals, leading to a wide range of 
cellular responses such as growth, differentiation, inflammation, and apopto- 
sis. External stimuli are sensed by cell surface markers, then travel through 
a cascade of protein modifications of signaling proteins, and eventually lead 
to changes in nuclear transcription. Single cell interventional data of 11 well- 
studied proteins from the MAPK pathways were originally generated by 
Sachs et al. (2005) using the intracellular multicolor flow cytometry tech- 
nique. This pathway was perturbed by 9 different stimuli, each targeting 
a different protein in the selected pathway (Figure 1 and Table 1). Sachs 
et al. (2005) applied Bayesian network analysis to infer the causal protein- 
signaling network. Correcting the bias in the commonly used algorithm pro- 
posed by Friedman and Killer (2003), Ellis and Wong (2008) reanalyzed this 
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data set through sampling BN structures from the correct posterior distri- 
bution. Both studies used the discretized data where the protein expression 
levels were grouped into three levels: "low," "middle," and "high." The in- 
hibited molecules were set at "low" values, and activated molecules were set 
to level "high." We apply our method to this data and compare the results 
with those from Ellis and Wong (2008). 

We infer the networks using HM and RHM with Vij = 0.1 and Vij = 10. 
Each analysis has five MCMC runs. Figure 8 shows the inferred posterior 
means wuj\ in one MCMC run (more can be found in Supplementary Fig- 
ures S4 ~ S6 [Luo and Zhao (2010)]), and the inferred networks from five 
MCMC runs, from each method. We use the same symbols as in simulation 
studies to indicate true or false inferences, where the "true" network is taken 
to be the network in Figure 3 of Sachs et al. (2005), which is the current 
understanding of this pathway. 

Compared to HM, RHMs lead to fewer true associations with high values 
of wuj) (v^ = 10) or smaller gaps of w^j^ between most true and false 
associations (v^ = 0.1). Taking the threshold u\ = 0.2 and requiring Uf > 0.6 
in five runs of HM, we get 21 associations, with 5 missing edges and 6 false 
positives. Requiring u[ = 0.11 and uj > 0.6 in RHM with = 0.1, we get 
19 associations, with 5 missing edges and 4 false ones. The threshold 0.11 
exceeds the value (0.1) from the permutation study by only a small amount, 
implying that RHM offers weaker support to true associations than HM. 
Setting u[ = 0.994 and u f > 0.6 in RHM with v^j = 10, we only get 14 
associations, with 8 missing and 2 false associations. 

From RHM, we can only correctly infer the directions of five or six 
edges. The causal relations for most inferred associations can not be deter- 
mined. But HM leads to a better result: 9 true directed edges, 1 direction- 
undetermined, 4 reversed, 6 missed, and 4 false edges, under the thresholds 
u\ = 0.2, ii2 = 0.1, = 0.3, and Uf > 0.8 (Figure 8 and Table 3). This in- 
ferred network is comparable with that from Ellis and Wong (2008), which 
contains 9 true directed edges, 3 reversed, 8 missed, and 6 false edges. The 



Table 3 

Summary of the inferred networks applying different methods to the real data 







True 


Undetermined 


Reversed 


Missing 


False 


Hamming distance 


HM 




9 


1 


4 


6 


4 


15 


RHM 


= 0.1 


6 


8 


1 


5 


4 


18 


RHM 


= 10 


5 


5 


2 


8 


2 


IT 


mHM 




6 


5 


1 


8 


4 


18 


BN 




9 





3 


8 


6 


17 



Here mHM denotes the modified model described in Supplementary Material SI [Luo and 
Zhao (2010)] which models the varying variances of intrinsic noises. 



HMS FOR SIGNALING PATHWAY INFERENCE 



19 




10 20 30 40 50 

indices of pairs 




10 20 30 40 50 

indices of pairs 




10 20 30 40 50 

indices of pairs 



Fig. 8. Inference results for the real data. From top to bottom: HM, RHM with Vij = 0.1, 
and RHM with Vij = 10. In networks, solid arrowed lines represent correctly inferred true 
edges, dashed thick lines with labels V represent edges whose directions cannot be deter- 
mined from the simulations, dashed arrowed thin lines with labels "r" represent reversed 
edges, dotted arrowed thick lines represent missing edges, and dotted thin lines with labels 
"-/-" represent false positive edges. 
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Hamming distances of these two networks to Figure 1 are 15 and 17, respec- 
tively. These results are summarized in Table 3. 

In MCMC analysis, we take = jj = 1 for % = 1, 2 and j = 1, . . . , 6. To 
check the sensitivity of HM, we also consider other values: 7, = 0.1 or 100, 
and /3j = 0.1 or 0.0001. Taking /3j = 0.1, we get a network (not shown) with 
9 true directed edges, 1 direction-undetermined, 4 reversed, 6 missed, and 6 
false associations. Other values of the hyperparameters result in 1~3 fewer 
true associations, and at least 3 fewer true directed edges. All these results 
are based on 5,500,000 iterations of MCMC updates in each run, which take 
about 20 hours on a node with an Intel(R) Xeon(R) 3 GHz CPU and a 16G 
memory. 

5. Discussion. We have proposed hierarchical statistical methods to infer 
a signaling pathway from single cell data collected from a set of perturbation 
experiments. The advantage of this method is that it provides a more explicit 
framework to relate the activity levels of different proteins. In our models, 
we assume that the activity level of each protein is linearly associated with 
a small subset of other proteins under each condition. Using a Bayesian hi- 
erarchical structure, we model the existence of an association between two 
proteins both at the overall level and at the experimental level. The overall- 
level probabilities measure the strength of associations between any two 
proteins across all experiments. The experimental-level probabilities reflect 
the changes of associations between proteins under different conditions. Our 
inferential procedure consists of two steps. First we infer the existence of an 
association between any pair of proteins based on the overall-level probabil- 
ities. Then for those pairs of proteins inferred to be associated, we infer the 
directions of the causal relations based on the changes in the experimental 
level probabilities. The basic rationale in our causal inference is that for 
two associated proteins, controlling over the target molecule destroys the 
association, while perturbing the regulatory molecule does not. 

We consider hierarchical models with (RHM) and without (HM) the re- 

(k) (k) 

striction that Wij = Wji and w\- = for each k. For RHM, we have to 
specify the hyperparameter Vij prior to MCMC analysis. We have considered 
the inference results when the value of is set at 0.1 and 10. Higher values 
of lead to higher ranges of the inferred overall-level and experimental- 
level probabilities, and smaller changes in experimental-level probabilities. 
In HM, the experimental-level probabilities can be integrated out, so the 
posterior inference of other parameters is independent of v^j. Hence, the 
choice of associations, which is based on the overall-level probabilities, is 
independent of v^. We only need to specify in the causal inference. To 
better reflect the changes of the experimental-level probabilities, we suggest 
smaller values for v^, for example, Vij =0.1. Both HM and RHM perform 
well in simulation studies. 
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We need to choose thresholds to infer the causal network: u\ for associ- 
ation inference and U2 and U3 for causal directional inference. Noting the 
jumps in the plots of wuj), we propose to choose the threshold U\ where 
there are great differences in sorted wnj\. This is easily determined when 
variations in the data are well captured by the proposed hierarchical models 
(e.g., Figures 2 and 7). If there are no great differences in the sorted overall- 
level probabilities (e.g., Figure 8), one may decide the number of edges to be 
included and then choose the top ones. Threshold U2, which is taken as 0.1 
in our study, can be chosen based on the experimental-level probabilities of 
those unassociated pairs of proteins. Threshold U3 is closely related to Vy, 

~ (k) 

which measures the variability in in that a smaller leads to greater 

- (k) 

variabilities in w\j between the experiments when the target protein is and 

is not intervened. When Vij =0.1, a difference of 0.3 in w\- is enough to 
show the effect of intervening the target protein. 

Compared to the nonhierarchical model, hierarchical models have at least 
two advantages. First, the hierarchical structure allows information borrow- 
ing across different experiments while allowing for differences among exper- 
iments, leading to a more clear-cut inference on whether two proteins are 
related. Second, this modeling framework allows us to infer causal relation- 
ships between proteins from the presence and absence of the association 
across different perturbation conditions. Overall, our proposed hierarchical 
modeling provides a general framework for inferring networks from high- 
throughout data. 

There are several possible ways of extending this model. In Supplemen- 
tary Material SI [Luo and Zhao (2010)] we modify HM by incorporating 
varying variances of intrinsic noises under different experimental conditions. 
The modified model does not outperform HM in our simulation study. It 
is interesting to investigate when the varying variances of intrinsic noises 
are not ignorable and incorporating them improves the network inference. 
We also find in our simulation study that applying our methods to data 
where intrinsic noises are sampled from heavy tail distributions results in 
power loss in pathway inference. Therefore, there is a need to extend this 
hierarchical structure to model nonnormal data. 

SUPPLEMENTARY MATERIAL 

Additional descriptions and results of hierarchical models 

(DOI: 10.1214/10-AOAS425SUPP; .pdf). Materials include description and 
simulation results of the hierarchical model (mHM) with varying variances 
of intrinsic noises (cf^) 2 , MCMC algorithm for the hierarchical model (HM), 
direction inference for the restricted hierarchical model (RHM), and addi- 
tional figures of posterior inference and networks. 
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